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Abstract. Counting the number of triangles in a graph has many important ap- 
£f) . plications in network analysis. Several frequently computed metrics like the clus- 

tering coefficient and the transitivity ratio need to count the number of triangles in 
(»_»> | the network. Furthermore, triangles are one of the most important graph classes 

considered in network mining. In this paper, we present a new randomized al- 
?— i ' gorithm for approximate triangle counting. The algorithm can be adopted with 

*~4, different sampling methods and give effective triangle counting methods. In par- 

ticular, we present two sampling methods, called the q-optimal sampling and the 
edge sampling, which respectively give O(sm) and O(sn) time algorithms with 
nice error bounds (m and n are respectively the number of edges and vertices in 
the graph and s is the number of samples). Among others, we show, for exam- 
ple, that if an upper bound A e is known for the number of triangles incident to 
C^ . every edge, the proposed method provides an 1 ± e approximation which runs in 

0( — ^° 2 S " ) time, where A e is the average number of triangles incident to an 
r/. . edge. Finally we show that the algorithm can be adopted with streams. Then it, 

^) ' for example, will perform 2 passes over the data (if the size of the graph is known, 

otherwise it needs 3 passes) and will use O(sn) space. 
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■• ■ 1 Introduction 

O 

(f) • Graphs are fundamental structures for modeling complex relationships between data. 

Examples include: Internet (where vertices are routers and edges correspond to physi- 
cal links), World Wide Web (where vertices are web pages and edges correspond to hy- 
perlinks), social networks (where vertices are humans and edges correspond to friend- 

/\i • ships), traffic data (where vertices are places or cities and edges correspond to roads) 

and biological networks (where vertices are proteins and edges correspond to protein 
interactions). 

The problem of counting subgraphs of a certain class, is one of the typical problems 
in graph mining which in recent years has obtained considerable attentions. Triangles 
are one of the most important basic subgraphs. On the other hand, computation of sev- 
eral network indices and statistics are based on counting the number of triangles, which 
makes triangle counting an essential problem in network analysis. Clustering coefficient 
of a graph [ 18 1 is defined as the normalized sum of the fraction of neighbor pairs of a 



vertex of the graph that are connected. Transitivity coefficient of a graph 1111 . is defined 
as the ratio between three times the number of triangles and the number of length two 
paths in the graph. 

In terms of time complexity, the most efficient triangle counting algorithms are 
based on matrix multiplication. If A is the adjacency matrix of a graph G, the number 
of triangles in G is equal to 

^Tr(A 3 ) (1) 

where Tr(-) denotes the trace of a matrix defined as the sum of the elements on the 
main diagonal of the matrix. 

Time complexity of the most efficient known algorithm for matrix multiplication 
is 0(n 23727 ) @, lfl9l . where n is the number of vertices of the graph (the number of 
rows/columns of A). The exponent of n, denoted by u>, is called matrix multiplication 
exponent. Alon et al. give in [ 1 1 a more efficient triangle counting algorithm for sparse 
graphs. Time complexity of their method is 0(m,"+ 1 ), where m is the number of edges 
of the graph. For ui — 2.3727, this time complexity is equal to 0(m 1A1 ). 

However, exact triangle counting methods may be inefficient when the size of the 
graph is large. In these cases, an approximate algorithm is preferred in the cost of losing 
the exact number of triangles. In recent years, many algorithms have been proposed for 
approximate triangle counting. A widely used technique is the sparsification technique 
IfPTll . IfTBll . |[T4l and |[T3ll . In this technique, the graph is converted into a sparse graph 
and the number of triangles in the sparsified graph is counted. Then, the result is scaled 
to the original graph. Some other methods are based on approximate computation of al- 
gebraic properties of the graph like eigenvalues of the adjacency matrix [151. However, 
well-known methods for computing (or approximating) eigenvalues are heavily based 
on matrix multiplication and it is known that worst case time complexity of comput- 
ing eigenvalues is the same as matrix multiplication. On the hand, such approximate 
triangle counting algorithms look at the algebraic methods (like approximate matrix 
multiplication and low rank matrix approximation) as a black-box. However, samplings 
in the algebraic methods are done in a way to minimize the element-wise error or the 
frobenius norm of the error matrix and therefore, the error o f triangle counting is not 
minimized. 

In this paper, we propose a new randomized algorithm for approximate triangle 
counting. Our method is a variation of several approximate matrix multiplication algo- 
rithms [8 1, [10 1 and [9|. However, it does not generate any product matrix and several 
algorithmic aspects are different. Furthermore, the sampling methods which are crucial 
elements of the algorithm, are also different. The proposed algorithm can be seen as 
a general framework to which different sampling methods can be applied. Every sam- 
pling method gives a new triangle counting algorithm with its own error bounds and 
time complexity. In particular, we present two sampling methods, called the q-optimal 
sampling and the edge sampling, which respectively give O(sm) and O(sn) time al- 
gorithms with nice error bounds (m and n are respectively the number of edges and 
vertices in the graph and s is the number of samples). Among others, we show, for ex- 
ample, that if an upper bound A e is known for the number of triangles incident to every 
edge, the proposed method provides an 1 ± e approximation which runs in 0( °^~°f n ) 



time, where A e is the average number of triangles incident to an edge. As we will dis- 
cuss, some existing algorithms can be seen as adoptations of the proposed algorithm 
with specific sampling methods. We finally show that the algorithm can be extended 
to streams. Then it, for example, will perform 2 passes over the data (if the size of the 
graph is known, otherwise it needs 3 passes) and will use O(sn) memory cells. 

The rest of this paper is organized as follows. In Section we present a new ran- 
domized algorithm for approximate triangle counting. In Section[3] different sampling 
methods are introduced. In Section |4] we extend the algorithm for counting triangles 
in streams. An overview of related work is provided in Section [5] Finally, the paper is 
concluded in Section [6] 

Throughout the paper, G refers to a simple (i.e. loop-free and without multiple 
edges) and undirected graph. A refers to the adjacency matrix of G. Therefore, A is 
a square matrix consisting of Os and Is. Aij denotes the element in the i-th row and the 
j-th column of A. n and m denote the number of vertices of G (the number of rows and 
the number of columns of A) and the number of edges of G, respectively. A denotes the 
number of triangles in G. In this paper, we use an index % for referring to a row (column) 
in the adjacency matrix as well as for referring to the vertex of the graph corresponds 
to the row (column) i. 



2 The approximate triangle counting algorithm 

In this section, we present a randomized algorithm for approximate triangle counting. 
Suppose A is an n x n matrix which is the adjacency matrix of a graph G. As Equation 
Q]shows, in order to count triangles in G, we can compute the trace of A 3 . AlgorithmQ] 
shows the high level pseudo code of an approximate triangle counting algorithm, based 
on randomized calculation of the trace of A 3 . 

In every iteration t of the loop in Lines [JlfTOl of Algorithm Q] first the following 
probabilities are computed: 



p 1 ,p 2 ,-.-,Pn > such that J~]pj = 1 



Then, an i e {1, . . . , n} is selected with probability pi, and the following probabil- 
ities are computed: 

n 

qi\i, • ■ • , q n \i > such that ^ q^ = 1 

3=1 

Finally, an j G {1, . . . , n) is selected with probability qju and the number of trian- 
gles is estimated by 

a _ J2d=l AdiAjjAjd 

Pt — t, \ l ) 

m<ij\i 

The final estimation, 0, is the average of the estimations of different trials. 
In the rest of this section, we study some important properties of AlgorithmQ] 



Algorithm 1 High level pseudo code of the approximate triangle counting algorithm. 

TriangleCounter 

Require: A graph G. 

Ensure: The approximate number of triangles in G. 

1: {Let A be the n x n adjacency matrix of G} 

2: /3^0 

3: for t = 1 to s do 

4: Compute pi, . . . ,p n 

5: Select i £ {1, . . . , n} with the probability pi 

6: Compute Qi|i,...,g„|i 

7: Select j £ {1, . . . , n} with the probability q^u 



P^P + Pt 

end for 



Lemma 1. In Algorithm\l\ for every t S {1, . . . , s} we have: 



B(A) = E(/3) = ^ (3) 



Proof. For every £ e {1, . . . , s}, we have: 



n n /v^« AAA 
1 n n n 

rLLL AdiAijAjd 



6 

»=1 j=l d=l 
= ^(^) 

On the other hand, /3 t 's are independent random variables and (3 is the sum of s inde- 
pendent random variables /3i . . . /3 S divided by s. Therefore 



Lemma 2. 7n Algorithm\l\ variance of every random variable fit is 



ou \ , , Pi1j\i 

Proof We have: 

Var(/3 t ) = E(/3 t 2 )-(E(/3 t )) 2 



(4) 



Then 



(5) 



.. , ; | 6 P^J\i m ~v~^x &<H\i 



and 



(E(/3 t )) 2 = ^(Tr(^ 3 )) 2 (6) 



Therefore 



Var(A) = -\YY i^i^AijA^f _ ^^y 



(7) 



Since /3 is the average of s independent copies of j3 t , then 

Var(/3) = Var(ft) = J- ( f Y (ELl ^^ Ajd)2 - (Tr(A^)) 2 ) (8) 
s 36s Hz^ PiQj\i J 

For a vertex i of the graph, ZocaZ triangles of z are triangles which are incident to i. 
The number of local triangles of i, denoted by Ai, equals to 

_. n n 
3=1 d=l 

Let {i,j} be an edge of the graph. Local triangles of {i,j} are triangles for which {i, j } 
is an edge. The number of local triangles of {i, j}, denoted by Au^x, is equal to 



A {i,3} = J2 A diAijA. 



3d 



d=l 



If i is not connected to j, the number of local triangles of {i, j} is equal to 0. The fol- 
lowing holds between Ai and A^^y. Ai = | Xw=i A {i,j}- Let A refer to the number 
of triangles in the graph. We have: A — i X^jLi A i- 

Using the notion of local triangles of edges, Equation|8]can be re-written as: 

Lemma 3. 7f in Algorithm\l\probabilities p\,p2, ■ ■ ■ ,p n and q\\i, q^\i, . ■ . , q n \i are ac- 
cessible in constant time, its time complexity will be O(sn). 



Proof. The loop in Lines [3ITT01 is performed for s times. Inside the loop, the most time 
consuming step is Line [8] which takes 0(n) time. Therefore, time complexity of the 
algorithm is 0{sn). ■ 



3 Sampling methods 

In this section, we present a number of sampling methods. First in Section 13.11 the 
optimal sampling is investigated. Then, since the optimal sampling might be computa- 
tionally expensive, other near-optimal samplings are introduced. 

3.1 Optimal sampling 

Lemma|4]introduces the probabilities and error bound of the optimal sampling. 
Lemma 4. If for 1 < i < n, pi's are equal to 

Pi - H do) 

and then after selecting i, for 1 < j ' < n, qju 's are 

the variance of j3 presented in Equation\8\is minimized. The minimized variance is 0. 

Since the error bound of the optimal sampling is 0, it gives an exact triangle counting 
algorithm. On the other hand, time complexity of computation of p^s in Equation [TOl 
is the same as time complexity of exact triangle counting. Therefore, using Algorithm 
Q]with the optimal sampling is the same (in both accuracy and complexity) as using an 
exact triangle counting algorithm. 

3.2 g-optimal sampling 

In the g-optimal sampling, every vertex i, 1 < i < n, is selected by some strategy 
(which can be performed in 0(1) time). For example, they are selected uniformly at 
random (therefore pi = —), or they are selected proportional to their degrees (therefore, 
Pi = -fr, where deg(i) refers to the degree of i). Then, every vertex j is selected in 
a way to minimize Var(/3). 

Lemma 5. In the q-optimal sampling, after choosing a vertex i, if every vertex j is 
selected with probability 

^ " ^2A~ (12) 

the variance of (3 is minimized. 

A 2 
Proof. In order to minimize Var (/3), we need to minimize Y^j=i { '.' ,} , because other 

parts of Var(/3) are independent of j. We define 

™ A - ■ 2 
f(Qi\i,<l2\i,- ■ ■ ,Qn\i) = 2^ ■ 



i=i 



qj\i 



and substitute Q'nii by 1 — Y)^_ 1 g,-> i n and form equations J^— = 0, for 1 < j < n — 1. 
We get 



=* «■< = r =1 *'^ > (13) 

Summing g^u's, for 1 < j < n — 1, and doing simplifications, we get 

E"'=i A {i,j} 









Putting the value of X}?=i 3j|i mto Equation [T3l and doing simplifications, we get 
the value of q^ for the g-optimal sampling: 



1j\i 



A {i,j} _ A {i,j} 



T,]>=iA{i,j'} 2A 



If vertices i are selected proportional to their degrees, the variance of (3 in the q- 
optimal sampling will be 

9s f— ' deg(i) s 
and if they are selected selected uniformly at random, Var(/3) will be 

The motivation for selecting vertices i proportional to their degrees is that, as studied 
in lfl5l . in most of real-world networks, vertices of higher degrees have higher number 
of local triangles. Then, for every two vertices i and i', if it holds that deg(i) > deg(i') 
implies Ai > Ay , it can be shown that selecting vertices i proportional to their degrees 
gives a better sampling than choosing them uniformly at random. 

If the g-optimal sampling is used, in every iteration of the loop in Lines [3lfT0l of 
Algorithm Q] probabilities pi and qju can be computed in 0(m) time. On the other 
hand, it takes 0(n) time to compute f3 t . Therefore, time complexity of Algorithm Q] 
with the g-optimal sampling will be O(sm). 

The ^-optimal sampling can provide efficient lie approximations, specifically if 
some information on local triangles of vertices is available. For example, consider the 
version of the g-optimal sampling where vertices i are selected uniformly at random 



(pi — -). Suppose that there exists an already known value A" such that for every 
vertex i, Ai < A", Then, in every iteration of the loop in Lines l3lfT0l of Algorithm[T] a 
random variable Xt can be defined as X t — -££»■ — '— , where ZLm is the number 

nA v 3A V v ' 

of local triangles of the vertex selected in iteration t. We have: E(X+) = — 4=- = ££■, 

^ fe y T; nA" A" 

where A" is the average number of local triangles of vertices. By Chernoff bound we 
obtain: 

e 2 sA v 



Pr 



lA„ A v A v 
s t=i &° A" 



< exp ^- (16) 

V 2A V I 



If s = f7( A ^° g 2 n ), then ^- J2t=i X t approximates A within a factor of e with prob- 
ability at least 1 — n~ c for any constant c. This gives an 0{ "^° s " ) time algorithm 

which approximates A within a factor of e. Specifically, if A" is greater than A" only 
by a factor of a constant, time complexity of the algorithm will be 0{ — ° s " ). 



3.3 Edge sampling 

In the edge sampling, first a vertex i is selected by some strategy (which can be done 
in 0(1) time). Then, a neighbor j of i is selected by some (probably different) strategy 
which also can be done in O(l) time. Since in this sampling computation of probabili- 
ties is done in O(l) time, time complexity of the algorithm is 0(sn). 

For example, i can be selected uniformly at random (therefore, pi — -), then, for 

i n 

every j, if j is a neighbor of i, qju is equal to de V., ; otherwise it is 0. This case 

is similar to the methods which uniformly sample an edge and count the number of 

triangles incident to it and scale the result. The first algorithm proposed in [12] and 

partially the algorithm of [5| are examples of such methods. In this case, variance of j3 

will be 

n ( ™ \ A 2 

In the second case of the edge sampling, i is chosen with probability pi = ^' . 
Then, similar to the first case, for every vertex j, if j is a neighbor of i, q^ will be 
, ,.\ . Otherwise, it will be 0. Variance of 8 in this case is: 

n n A2 

V ^)^EEW4 as) 

1=1 j = l 

Similar to the g-optimal sampling, the edge sampling can provide efficient lie 
approximations, specifically if some information on local triangles of edges is available. 
For example, consider the second case and suppose that there exists a known value A e 
such that for every edge {i, j}, Auj\ < A e . Then, in every iteration of the loop in 
Lines l3lfT0l of Algorithm \T\ a random variable X t can be defined as X t — -^**- = 

mA E 



{ '~ (t) , where Aujxm is the number of local triangles of the edge {i,j} selected in 
iteration t. We have: E(X+) = —A^ — 4-s where A e is the average number of local 

v ' mA" A" ° 

triangles of edges. Similarly, by Chernoff bound we obtain: 
Pr 



sf^ A e A e 



< exp :===- (19) 

V 2/i e ) 



and if s = fi{ A ^% n ), then ^^ 2t=i -X"* approximates A within a factor of e with 

probability at least 1— n~ c for any constant c. This gives an O ( "^ °f " ) time algorithm 

which approximates A within a factor of e. Specifically, if A e is greater than A e only 
by a factor of a constant, time complexity of the algorithm will be 0{ - ° s " ). 

4 Triangle counting in streams 

In many applications like World Wide Web and social networks, the dataset is too large 
to load it into the main memory. A widely used approach to address this problem is 
the use of the data stream model. A data stream is an ordered sequence in which data 
arrives one item at a time, and the algorithm has access to limited computation and 
storage capabilities. 

In this section, we extend the algorithm presented in Section [2] to streams. While 
our focus is the g-optimal sampling where vertices i are selected uniformly at random, 
the other sampling methods can be extended to streams in a similar way. Due to lack of 
space, we here omit details. If the number of vertices of the graph, n, is already known, 
our algorithm will need 2 passes over the data. Otherwise, an extra pass will be needed 
to find it. First, s integers (vertices) i are selected independently at random with uniform 
probability —, Then: 

- During the first pass, for every i, the neighborhood vector of i, denoted by Z\ is 
formed. The neighborhood vector of i, shows which vertices of the graph are a 
neighbor of i and which ones are not. For every vertex j, if {i,j} is an edge of the 
graph, X % j is set to 1, otherwise, it is set to 0. 

- During the second pass, for every vertex i and for all vertices j, the number of local 
triangles of the edge between i and j is calculated. To do so, for every i a vector V % 
of size n is used, where V 1 j stores the number of local triangles of {i, j }. When an 
edge Ajd is visited, if there exists an edge between i and j (i.e. X % j = 1) and an 
edge between i and d (i.e. T l d = 1), a triangle consisting of the vertices i, j and d 
is found. Therefore, V j and V l d and z % are increased by 1. z % is used to store the 
number of local triangles of i. 

During these two passes, the information required for q-optimal sampling and ap- 
proximate triangle counting are gathered. Then, for every i, a j G {1, . . . , n} is selected 
with probability qju = 2 { ^ 3> = -^ and the approximate number of triangles is calcu- 
lated. 

Every pass needs O(sn) space from the main memory. After twice passing over the 
stream and calculating 2?s, V l s and z z s, the approximate number of triangles can be 



determined in 0(1) time. We can store random variables X t , if A" is known. In this 

case, we will need 0( "-^ lo 2 s " ) space to provide a 1 ± e approximation. Therefore, we 
can present the following theorem: 

Theorem 1. There is a 2-pass algorithm (ifn is already known, otherwise it will need 
3 passes) to count the number of triangles in a stream of edges which needs 0(sn) 
memory space and constant update time and its error guarantee obeys Equation 1751 
Specifically, with space usage 0( — -^ ° 2 S " ), it gives a 1 ± e approximation. 

Similar results can be presented for other samplings. 

5 Related work 

Buriol et al. presented one of the first approximate triangle counting algorithms. In 
their method, an edge and a vertex are selected by random and it is checked whether 
they form a triangle or not. Then, the fraction of tests which form a triangle is scaled 
and returned as an estimation of the number of triangles. In this method, if 

s = los ^{ ¥ 3 

independent trials are done, where Tj is the number of triples of vertices with i edges, 
with probability at least 1 — S, the estimated number of triangles is between (1 — e)T^ 
and (1 + e)T^. However, Tq + T\ + T2 can be very large compared to T3. 

The other method which is efficient only for triangle-dense graphs is |2). Their 
method is based on reducing triangle counting to estimating the zero-th, first and sec- 
ond frequency moments. They showed that there is a streaming algorithm that for any 
adjacency stream of a graph, computes an (e — S) approximation of the number of 
triangles using space: 

(1 1 / Ti -4- To -I- Tq \ \ 

? X l0g 1 X [~~Ts J X l0g? 7 

Tsourakakis IPT31 proposed a triangle counting algorithm based on computing the 
largest eigenvalues of the adjacency matrix of an undirected graph to approximate both 
the total as well as the local number of triangles in the graph. Tsourakis's method ex- 
ploits the following property: the total number of triangles in an undirected graph is 
i Y^j—i -\? 3 ' where Xj is the j-th eigenvalue of A. Using the Lanczos method [7], the 
eigenvalues are generated from the biggest one to the smallest one. An approximation 
is done when the smallest generated eigenvalue contributes very little to the total num- 
ber of triangles. However, it is known that time complexity of finding eigenvalues is 
the same as time complexity of matrix multiplication. For approximating the number of 

V™_ \- 3 Ui - 2 

local triangles of a vertex i, he exploited the following property: Z\,; = — 3-1 2 J — — , 
where Xj is the j-th eigenvalue and Uij is the j-th entry of the i-th eigenvector of A. 

In lfl6l . the authors proposed the DOULION algorithm for approximate triangle 
counting. In this method, every edge e of the graph is removed with a sparsification 



10 



probability 1 — p. If it survives, the weight - is assigned to it. Then, the number of tri- 
angles in the original graph is approximated as the number of triangles in the sparsified 
graph times -4-. The following error bound was provided for this method: 

A{p 3 - p 6 ) + 2k(p 5 - p 6 ) 



Var(A) 



]/> 



where k is the number of pairs of triangles which are edge-disjoint. 

The randomized algorithm of [14| colors the vertices of the graph with N = - 
colors uniformly at random, counts triangles whose vertices have the same color, and 
scales that count appropriately. The authors showed that for enough large values of 
p, their estimation of the number of triangles is concentrated around its expectation. 
In |Q~3), the authors combined the sparsification techniques and the the idea of vertex 
partitioning (into high degrees and low degrees) presented in [1 1. They developed an 

3. 

0(m + m 2 }°f " ) time (1 ± e) -approximation algorithm. 

In O, the authors studied the problem of approximate local triangle counting in 
large graphs. Their approximation algorithms are based on the idea of min-wise inde- 
pendent permutations [4|. Their algorithms operate in a semi-streaming fashion, using 
0(n) space in main memory and performing O(logn) passes over the edges of the 
graph. 



6 Conclusions 

In this paper, we proposed a new randomized algorithm for approximate triangle count- 
ing. The algorithm provides a general framework which can be adopted with differ- 
ent sampling techniques and give methods with different time complexities and error 
bounds. For example, it can be adopted with the q-optimal sampling and the edge sam- 
pling, presented in the paper, and give linear time algorithms for approximate triangle 
counting. We showed that if an upper bound A e is known for the number of triangles in- 
cident to every edge, the proposed method provides an 1 ± e approximation which runs 
in 0( e ^° 2 s " ) time, where A e is the average number of triangles incident to an edge. 
Also, if an upper bound A v is known for the number of triangles incident to every ver- 
tex, the proposed method provides an 1 ± e approximation which runs in 0( ^° s " ) 

time, where A v is the average number of triangles incident to a vertex. 

Finally we showed the algorithm can be extended to streams. Then it, for example, 
will perform 2 passes over the data (if the size of the graph is known, otherwise it needs 
3 passes) and will use O(sn) space. 
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